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Abstract 

We introduce a model for construct¬ 
ing vector representations of words by 
composing characters using bidirectional 
LSTMs. Relative to traditional word rep¬ 
resentation models that have independent 
vectors for each word type, our model 
requires only a single vector per char¬ 
acter type and a fixed set of parame¬ 
ters for the compositional model. De¬ 
spite the compactness of this model and, 
more importantly, the arbitrary nature 
of the form-function relationship in lan¬ 
guage, our “composed” word representa¬ 
tions yield state-of-the-art results in lan¬ 
guage modeling and part-of-speech tag¬ 
ging. Benefits over traditional baselines 
are particularly pronounced in morpholog¬ 
ically rich languages (e.g., Turkish). 

1 Introduction 

Good representations of words are important for 
good generalization in natural language process¬ 
ing applications. Of central importance are vec¬ 
tor space models that capture functional (i.e., se¬ 
mantic and syntactic) similarity in terms of ge¬ 
ometric locality. However, when word vectors 
are learned—a practice that is becoming increas¬ 
ingly common—most models assume that each 
word type has its own vector representation that 
can vary independently of other model compo¬ 
nents. This paper argues that this independence 
assumption is inherently problematic, in particular 
in morphologically rich languages (e.g., Turkish). 
In such languages, a more reasonable assumption 
would be that orthographic (formal) similarity is 
evidence for functional similarity. 


However, it is manifestly clear that similarity in 
form is neither a necessary nor sufficient condi¬ 
tion for similarity in function: small orthographic 
differences may correspond to large semantic or 
syntactic differences (butter vs. batter), and large 
orthographic differences may obscure nearly per¬ 
fect functional correspondence (rich vs. affluent). 
Thus, any orthographic ally aware model must be 
able to capture non-compositional effects in addi¬ 
tion to more regular effects due to, e.g., morpho¬ 
logical processes. To model the complex form- 
function relationship, we turn to long short-term 
memories (LSTMs), which are designed to be able 
to capture complex non-linear and non-local dy¬ 
namics in sequences (jHochreiter and Schmidhu-] 


ber, 1997). We use bidirectional LSTMs to “read” 


the character sequences that constitute each word 
and combine them into a vector representation of 
the word. This model assumes that each charac¬ 
ter type is associated with a vector, and the LSTM 
parameters encode both idiosyncratic lexical and 
regular morphological knowledge. 


To evaluate our model, we use a vector- 
based model for part-of-speech (POS) tagging 
and for language modeling, and we report ex¬ 
periments on these tasks in several languages 
comparing to baselines that use more tradi¬ 
tional, orthographically-unaware parameteriza- 
tions. These experiments show: (i) our character- 
based model is able to generate similar representa¬ 
tions for words that are semantically and syntacti¬ 
cally similar, even for words are orthographically 
distant (e.g., October and January)', our model 
achieves improvements over word lookup tables 
using only a fraction of the number of parameters 
in two tasks; (iii) our model obtains state-of-the- 
art performance on POS tagging (including estab¬ 
lishing a new best performance in English); and 






(iv) performance improvements are especially dra¬ 
matic in morphologically rich languages. 

The paper is organized as follows: Section [2] 
presents our character-based model to generate 
word embeddings. Experiments on Language 
Modeling and POS tagging are described in Sec¬ 
tions [4] and [5] We present related work in Sec¬ 
tion [6} and we conclude in Section [7] 

2 Word Vectors and Wordless Word 
Vectors 


It is commonplace to represent words as vectors. 
In contrast to naive models in which all word types 
in a vocabulary V are equally different from each 
other, vector space models capture the intuition 
that words may be different or similar along a va¬ 
riety of dimensions. Learning vector representa¬ 
tions of words by treating them as optimizable pa¬ 
rameters in various kinds of language models has 
been found to be a remarkably effective means 
for generating vector representations that perform 
well in other tasks ( Collobert et al., 2011 1 |Kalch-| 


brenner and Blunsom, 2013; Liu et al., 2014 ; Chen 
and Manning, 2014). Lormally, such models de¬ 
fine a matrix P E which contains d pa¬ 

rameters for each word in the vocabulary V. Lor a 
given word type w E V, a column is selected by 
right-multiplying P by a one-hot vector of length 
\V\, which we write 1„,, that is zero in every di¬ 
mension except for the element corresponding to 
w. Thus, P is often referred to as word lookup 
table and we shall denote by E the embed¬ 
ding obtained from a word lookup table for w as 
= P ■ l w . This allows tasks with low amounts 
of annotated data to be trained jointly with other 
tasks with large amounts of data and leverage the 
similarities in these tasks. A common practice to 
this end is to initialize the word lookup table with 
the parameters trained on an unsupervised task. 
Some examples of these include the skip-n-gram 
and CBOW models of [Mikolov et al. (2013) ). 


2.1 Problem: Independent Parameters 

There are two practical problems with word 
lookup tables. Lirstly, while they can be pre¬ 
trained with large amounts of data to learn se¬ 
mantic and syntactic similarities between words, 
each vector is independent. That is, even though 
models based on word lookup tables are often ob¬ 
served to learn that cats, kings and queens exist in 
roughly the same linear correspondences to each 


other as cat, king and queen do, the model does 
not represent the fact that adding an s at the end 
of the word is evidence for this transformation. 
This means that word lookup tables cannot gen¬ 
erate representations for previously unseen words, 
such as Frenchification, even if the components, 
French and -ification, are observed in other con¬ 
texts. 

Second, even if copious data is available, it is 
impractical to actually store vectors for all word 
types. As each word type gets a set of parameters 
d, the total number of parameters is d x | Vj, where 
\V\ is the size of the vocabulary. Even in rela¬ 
tively morphological poor English, the number of 
word types tends to scale to the order of hundreds 
of thousands, and in noisier domains, such as on¬ 
line data, the number of word types raises con¬ 
siderably. Lor instance, in the English wikipedia 
dump with 60 million sentences, there are approx¬ 
imately 20 million different lowercased and tok- 
enized word types, each of which would need its 
own vector. Intuitively, it is not sensible to use the 
same number of parameters for each word type. 

Linally, it is important to remark that it is 
uncontroversial among cognitive scientists that 
our lexicon is structured into related forms—i.e., 
their parameters are not independent. The well- 
known “past tense debate” between connection- 
ists and proponents of symbolic accounts con¬ 
cerns disagreements about how humans represent 
knowledge of inflectional processes (e.g., the for¬ 
mation of the English past tense), not whether 
such knowledge exists (Marslen-Wilson and Tyler, 
1998). 


2.2 Solution: Compositional Models 


Our solution to these problems is to construct 
a vector representation of a word by composing 
smaller pieces into a representation of the larger 
form. This idea has been explored in prior work 
by composing morphemes into representations of 
words (Luong et al., 2013; Botha and Blunsom, 
2014; Soricut and Och, 2015). Morphemes are an 
ideal primitive for such a model since they are— 
by definition—the minimal meaning-bearing (or 
syntax-bearing) units of language. The drawback 
to such approaches is they depend on a morpho¬ 
logical analyzer. 

In contrast, we would like to compose repre¬ 
sentations of characters into representations of 
words. However, the relationship between words 




























forms and their meanings is non-trivial (de Saus- 


sure, 1916). While some compositional relation¬ 


ships exist, e.g., morphological processes such as 
adding -ing or -ly to a stem have relatively reg¬ 
ular effects, many words with lexical similarities 
convey different meanings, such as, the word pairs 
lesson lessen and coarse <(=>• course. 


3 C2W Model 


cats 



Our compositional character to word (C2W) 
model is based on bidirectional LSTMs (Graves 
and Schmidhuber, 2005] ), which are able to 
learn complex non-local dependencies in sequence 
models. An illustration is shown in Figure [T] The 
input of the C2W model (illustrated on bottom) is 
a single word type w, and we wish to obtain is 
a d-dimensional vector used to represent w. This 
model shares the same input and output of a word 
lookup table (illustrated on top), allowing it to eas¬ 
ily replace then in any network. 

As input, we define an alphabet of characters 
C. For English, this vocabulary would contain an 
entry for each uppercase and lowercase letter as 
well as numbers and punctuation. The input word 
w is decomposed into a sequence of characters 
ci,..., Cm, where m is the length of w. Each c t 
is defined as a one hot vector l Ci , with one on the 
index of c* in vocabulary M. We define a projec¬ 
tion layer Pc e M dc,x|c| , where dc is the number 
of parameters for each character in the character 
set C. This of course just a character lookup table, 
and is used to capture similarities between charac¬ 
ters in a language (e.g., vowels vs. consonants). 
Thus, we write the projection of each input char¬ 
acter a as e Ci = P c • lf, : . 

Given the input vectors xi,..., x m , a LSTM 
computes the state sequence hi,, h m+ i by it¬ 
eratively applying the following updates: 


it = cr( W ix x t + Wj/,ht_i + Wj C C(_i + b*) 
ft = cr(W/ X x t + W fhht-x + W fcCt-i + h f ) 
C t = ff © C 4 _1+ 

it © tanh(W cx x t + W c /,hf_i + b c ) 

O/ — (t(M^ ox xi T W 0 /,hi_i T 3M oc Cf T b 0 ) 
ht = o t ©tanh(ct), 

where a is the component-wise logistic sig¬ 
moid function, and 0 is the component-wise 
(Fladamard) product. LSTMs define an extra cell 
memory c %, which is combined linearly at each 
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Figure 1: Illustration of the word lookup tables 
(top) and the lexical Composition Model (bottom). 
Square boxes represent vectors of neuron activa¬ 
tions. Shaded boxes indicate that a non-linearity. 


timestamp t. The information that is propagated 
from ct_i to C/ is controlled by the three gates if, 
ft, and Of, which determine the what to include 
from the input x t , the what to forget from c t -\ and 
what is relevant to the current state h/. We write 
VV to refer to all parameters the LSTM (W !X , 
W f x , bf, .. Thus, given a sequence of charac¬ 
ter representations ,..., e ( C )n as input, the for¬ 
ward LSTM, yields the state sequence Sq ,..., sL 
while the backward LSTM receives as input the re¬ 
verse sequence, and yields states s^,..., Sq. Both 
LSTMs use a different set of parameters W’^' and 
W b . The representation of the word w is obtained 
by combining the forward and backward states: 

e w = D/s m + + b d , 

where D^, D 6 and brf are parameters that deter- 




































































mine how the states are combined. 

Caching for Efficiency. Relative to e]£ com¬ 
puting is computational expensive, as it re¬ 
quires two LSTMs traversals of length m. How¬ 
ever, c ( ir only depends on the character sequence 
of that word, which means that unless the parame¬ 
ters are updated, it is possible to cache the value of 
for each different m’s that will be used repeat¬ 
edly. Thus, the model can keep a list of the most 
frequently occurring word types in memory and 
run the compositional model only for rare words. 
Obviously, caching all words would yield the same 
performance as using a word lookup table e£, but 
also using the same amount of memory. Conse¬ 
quently, the number of word types used in cache 
can be adjusted to satisfy memory vs. perfor¬ 
mance requirements of a particular application. 

At training time, when parameters are changing, 
repeated words within the same batch only need to 
be computed once, and the gradient at the output 
can be accumulated within the batch so that only 
one update needs to be done per word type. For 
this reason, it is preferable to define larger batches. 


4 Experiments: Language Modeling 


Our proposed model is similar to models used to 
compute composed representations of sentences 
from words ( |Cho et al., 20141 |Li et al., 2015| ). 
However, the relationship between the meanings 
of individual words and the composite meaning 
of a phrase or sentence is arguably more regular 
than the relationship of representations of charac¬ 
ters and the meaning of a word. Is our model capa¬ 
ble of learning such an irregular relationship? We 
now explore this question empirically. 

Language modeling is a task with many appli¬ 
cations in NLR An effective LM requires syntactic 
aspects of language to be modeled, such as word 
orderings (e.g., “John is smart” vs. “John smart 
is”), but also semantic aspects (e.g., “John ate fish” 
vs. “fish ate John”). Thus, if our C2W model 
only captures regular aspects of words, such as, 
prefixes and suffixes, the model will yield worse 
results compared to word lookup tables. 


4.1 Language Model 

Language modeling amounts to learning a func¬ 
tion that computes the log probability, log p(w), 
of a sentence w = (mi ,... , m n ). This quantity 
can be decomposed according to the chain rule 
into the sum of the conditional log probabilities 


E7=1 log p{wi I mi,..., Wi-1). Our language 
model computes log p(wi \ mi,...,mj_i) by 
composing representations of words mi,...,] 
using an recurrent LSTM model (Mikolov et al., 
2010||Sundermeyer et al., 2012| ). 


The model is illustrated in Figure [2j where we 
observe on the first level that each word m* is pro¬ 
jected into their word representations. This can be 
done by using word lookup tables c£, in which 
case, we will have a regular recurrent language 
model. To use our C2W model, we can sim¬ 
ply replace the word lookup table with the model 
f(wi ) = e^. Each LSTM block Sj, is used to 
predict word m )+ ]. This is performed by project¬ 
ing the Si into a vector of size of the vocabulary V 
and performing a softmax. 


cats eat fish 



cats eat fish </s> 


Figure 2: Illustration of our neural network for 
Language Modeling. 


The softmax is still simply a d X V table, 
which encodes the likelihood of every word type 
in a given context, which is a closed-vocabulary 
model. Thus, at test time out-of-vocabulary 
(OOV) words cannot be addressed. A strategy 
that is generally applied is to prune the vocabu¬ 
lary V by replacing word types with lower fre¬ 
quencies as an OOV token. At test time, the prob¬ 
ability of words not in vocabulary is estimated as 
the OOV token. Thus, depending on the number 
of word types that arc pruned, the global perplexi¬ 
ties may decrease, since there are fewer outcomes 
in the softmax, which makes the absolute value of 
perplexity not informative when comparing mod¬ 
els of different vocabulary sizes. Yet, the rela¬ 
tive perplexity between different models indicates 
which models can better predict words based on 
their contexts. 
















































To address OOV words in the baseline setup, 
these are replaced by an unknown token, and also 
associated with a set of embeddings. During train¬ 
ing, word types that occur once are replaced with 
the unknown token stochastically with 0.5 proba¬ 
bility. The same process is applied at the character 
level for the C2W model. 

4.2 Experiments 

Datasets We look at the language model perfor¬ 
mance on English, Portuguese, Catalan, German 
and Turkish, which have a broad range of morpho¬ 
logical typologies. While all these languages con¬ 
tain inflections, in agglutinative languages affixes 
tend to be unchanged, while in fusional languages 
they are not. For each language, Wikipedia articles 
were randomly extracted until 1 million words are 
obtained and these were used for training. For de¬ 
velopment and testing, we extracted an additional 
set of 20,000 words. 

Setup We define the size of the word represen¬ 
tation d to 50. In the C2W model requires set¬ 
ting the dimensionality of characters dc and cur¬ 
rent states dcs- We set dc = 50 and dcs = 150. 
Each ESTM state used in the language model se¬ 
quence Si is set to 150 for both states and cell 
memories. Training is performed with mini-batch 
gradient descent with 100 sentences. The learn¬ 
ing rate and momentum were set to 0.2 and 0.95. 
The softmax over words is always performed on 
lowercased words. We restrict the output vocabu¬ 
lary to the most frequent 5000 words. Remaining 
word types will be replaced by an unknown token, 
which must also be predicted. The word represen¬ 
tation layer is still performed over all word types 
(i.e., completely open vocabulary). When using 
word lookup tables, the input words are also low¬ 
ercased, as this setup produces the best results. In 
the C2W, case information is preserved. 

Evaluation is performed by computing the per¬ 
plexities over the test data, and the parameters that 
yield the highest peiplexity over the development 
data are used. 

Perplexities Peiplexities over the testset are re¬ 
ported on Table [4] From these results, we can see 
that in general, it is clear that C2W always outper¬ 
forms word lookup tables (row “Word”), and that 
improvements are especially pronounced in Turk¬ 
ish, which is a highly morphological language, 
where word meanings differ radically depending 



Fusional 

Agglutinative 

Perplexity 

EN 

PT 

CA 

DE 

TR 

5-gram KN 

70.72 

58.73 

39.83 

59.07 

52.87 

Word 

59.38 

46.17 

35.34 

43.02 

44.01 

C2W 

57.39 

40.92 

34.92 

41.94 

32.88 

#Parameters 






Word 

4.3M 

4.2M 

4.3M 

6.3M 

5.7M 

C2W 

180K 

178K 

182K 

183K 

174K 


Table 1: Language Modeling Results 


on the suffixes used (evde —> in the house vs. ev- 
den —i > from the house). 

Number of Parameters As for the number of 
parameters (illustrated for block “#Parameters”), 
the number of parameters in word lookup tables is 
V x d. If a language contains 80,000 word types (a 
conservative estimate in morphologically rich lan¬ 
guages), 4 million parameters would be necessary. 
On the other hand, the compositional model con¬ 
sists of 8 matrices of dimensions dcs x ( hj+2dcs- 
Additionally, there is also the matrix that com¬ 
bines the forward and backward states of size 
d X 2 dcs- Thus, the number of parameters is 
roughly 150,000 parameters—substantially fewer. 
This model also needs a character lookup table 
with dc parameters for each entry. For English, 
there are 618 characters, for an additional 30,900 
parameters. So the total number of parameters for 
English is roughly 180,000 parameters (2 to 3 pa¬ 
rameters per word type), which is an order of mag¬ 
nitude lower than word lookup tables. 

Performance As for efficiency, both representa¬ 
tions can label sentences at a rate of approximately 
300 words per second during training. While this 
is surprising, due to the fact that the C2W model 
requires a composition over characters, the main 
bottleneck of the system is the softmax over the 
vocabulary. Furthermore, caching is used to avoid 
composing the same word type twice in the same 
batch. This shows that the C2W model, is rela¬ 
tively fast compared operations such as a softmax. 

Representations of (nonce) words While is is 
promising that the model is not simply learning 
lexical features, what is most interesting is that the 
model can propose embeddings for nonce words, 
in stark contrast to the situation observed with 
lookup table models. We show the 5-most-similar 
in-vocabulary words (measured with cosine simi¬ 
larity) as computed by our character model on two 











increased 

John 

Noahshire 

phding 

reduced 

Richard 

Nottinghamshire 

mixing 

improved 

George 

Bucharest 

modelling 

expected 

James 

Saxony 

styling 

decreased 

Robert 

Johannesburg 

blaming 

targeted 

Edward 

Gloucestershire 

christening 


Table 2: Most-similar in-vocabular words under 
the C2W model; the two query words on the left 
are in the training vocabulary, those on the right 
are nonce (invented) words. 


in-vocabulary words and two nonce word^This 
makes our model generalize significantly better 
than lookup tables that generally use unknown to¬ 
kens for OOV words. Furthermore, this ability to 
generalize is much more similar to that of human 
beings, who are able to infer meanings for new 
words based on its form. 


5 Experiments: Part-of-speech Tagging 


As a second illustration of the utility of our model, 
we turn to POS tagging. As morphology is a 
strong indicator for syntax in many languages, 
a much effort has been spent engineering fea¬ 


tures ( Nakagawa et al., 2001 Mueller et al., 2013) . 
We now show that some of these features can be 
learnt automatically using our model. 


The size of the forward s^" and backward states 
s b and the combined state 1 arc hyperparameters 
of the model, denoted as d ws , d , { vs and dwSi re¬ 
spectively. Finally, the output labels for index i 
are obtained as a softmax over the POS tagset, by 
projecting the combined state l t . 


cats eat fish 



NNS VBP NN 


Figure 3: Illustration of our neural network for 
POS tagging. 


5.2 Experiments 


5.1 Bi-LSTM Tagging Model 

Our tagging model is likewise novel, but very 
straightforward. It builds a Bi-LSTM over words 
as illustrated in Figure [3] The input of the model 
is a sequence of features f{w \),..., f[w n ). Once 
again, word vectors can either be generated us¬ 
ing the C2W model f(wi) = e~,., or word 
lookup tables f{wi) = e^. We also test the us¬ 
age of hand-engineered features, in which case 
fi(wi), ..., fn(wi)- Then, the sequential fea¬ 
tures f(wi),...,f(w n ) are fed into a bidirec¬ 
tional LSTM model, obtaining the forward states 
Sq, ..., s n and the backward states s^ +1 , ..., Sq. 
Thus, state s{ contains the information of all 
words from 0 to i and s b from n to i. The for¬ 
ward and backward states are combined, for each 
index from 1 to n, as follows: 

1 i = tanh(l/s{ + L b s b + b /), 

where L^, L ,J and b/ are parameters defining how 
the forward and backward states are combined. 

1 software submitted as supplementary material 


Datasets For English, we conduct experiments 
on the Wall Street Journal of the Penn Treebank 


dataset (Marcus et al., 1993), using the standard 
splits (sections 1-18 for train, 19-21 for tuning 
and 22-24 for testing). We also perform tests on 
4 other languages, which we obtained from the 
CoNLL shared tasks ( |Martf et al., 2007] Brants 


et al., 2002 Afonso et al., 2002} Atalay et al., 


2003). While the PTB dataset provides standard 


train, tuning and test splits, there are no tuning sets 
in the datasets in other languages, so we withdraw 
the last 100 sentences from the training dataset and 
use them for tuning. 


Setup The POS model requires two sets of hy¬ 
perparameters. Firstly, words must be converted 
into continuous representations and the same hy- 
perparametrization as in language modeling (Sec¬ 
tion [4]) is used. Secondly, words representations 
are combined to encode context. Our POS tagger 
has three hyperparameters d^ ws , d b ws and dws, 
which correspond to the sizes of LSTM states, and 
are all set to 50. As for the learning algorithm, 
use the same setup (learning rate, momentum and 
























































































mini-batch sizes) as used in language modeling. 

Once again, we replace OOV words with an un¬ 
known token, in the setup that uses word lookup 
tables, and the same with OOV characters in the 
C2W model. In setups using pre-trained word em¬ 
beddings, we consider a word an OOV if it was not 
seen in the labelled training data as well as in the 
unlabeled data used for pre-training. 



acc 

parameters 

words/sec 

Word Lookup 

96.97 

2000k 

6K 

Forward RNN 

95.66 

17.5k 

4K 

Backward RNN 

95.52 

17.5k 

4K 

Bi-RNN 

95.93 

40k 

3K 

Forward LSTM 

9742 

80k 

3K 

Backward LSTM 

97.08 

80k 

3K 

Bi-LSTM dcs = 50 

97.22 

70k 

3K 

Bi-LSTM 

97.36 

150k 

2K 


Compositional Model Comparison A compar¬ 
ison of different recurrent neural networks for the 
C2W model is presented in Table [3] We used our 
proposed tagger tagger in all experiments and re¬ 
sults are reported for the English Penn Treebank. 
Results on label accuracy test set is shown in the 
column “acc”. The number of parameters in the 
word composition model is shown in the column 
“parameters”. Finally, the number of words pro¬ 
cessed at test time per second are shown in column 
“words/sec”. 

We observe that approaches using RNN yield 
worse results than their LSTM counterparts with 
a difference of approximately 2%. This suggests 
that while regular RNNs can learn shorter charac¬ 
ter sequence dependencies, they are not ideal to 
learn longer dependencies. LSTMs, on the other 
hand, seem to effectively obtain relatively higher 
results, on par with using word look up tables (row 
“Word Lookup”), even when using forward (row 
“Forward LSTM”) and backward (row “Backward 
LSTM”) LSTMs individually. The best results are 
obtained using the bidirectional LSTM (row “Bi- 
LSTM”), which achieves an accuracy of 97.29% 
on the test set, surpassing the word lookup table. 

There are approximately 40k lowercased word 
types in the training data in the PTB dataset. Thus, 
a word lookup table with 50 dimensions per type 
contains approximately 2 million parameters. In 
the C2W models, the number of characters types 
(including uppercase and lowercase) is approxi¬ 
mately 80. Thus, the character look up table con¬ 
sists of only 4k parameters, which is negligible 
compared to the number of parameters in the com¬ 
positional model, which is once again 150k pa¬ 
rameters. One could argue that results in the Bi- 
LSTM model are higher than those achieved by 
other models as it contains more parameters, so 
we set the state size dcs = 50 (row “Bi-LSTM 
dcs = 50”) and obtained similar results. 

In terms of computational speed, we can ob¬ 
serve that there is a more significant slowdown 
when applying the C2W models compared to lan- 


Table 3: POS accuracy results for the English 
PTB using word representation models. 


System 

Fusional 

Agglutinative 


EN 

PT 

CA 

DE 

TR 

Word 

96.97 

95.67 

98.09 

97.51 

83.43 

C2W 

97.36 

97.47 

98.92 

98.08 

91.59 

Stanford 

97.32 

97.54 

98.76 

97.92 

87.31 


Table 4: POS accuracies on different languages 


guage modeling. This is because there is no longer 
a softmax over the whole word vocabulary as the 
main bottleneck of the network. However, we can 
observe that while the Bi-LSTM system is 3 times 
slower, it is does not significantly hurt the perfor¬ 
mance of the system. 


Results on Multiple Languages Results on 5 
languages are shown in Table [4j In general, we 
can observe that the model using word lookup 
tables (row “Word”) performs consistently worse 
than the C2W model (row “C2W”). We also com¬ 
pare our results with Stanford’s POS tagger, with 
the default set of features, found in Table [4] Re¬ 
sults using these tagger are comparable or bet¬ 
ter than state-of-the-art systems. We can observe 
that in most cases we can slightly outperform 
the scores obtained using their tagger. This is a 
promising result, considering that we use the same 
training data and do not handcraft any features. 
Furthermore, we can observe that for Turkish, our 
results are significantly higher (>4%). 


Comparison with Benchmarks Most state-of- 
the-art POS tagging systems are obtained by ei¬ 
ther learning or handcrafting good lexical fea¬ 
tures ( |Manning, 201 1 ( Sun, 2014] ) or using ad¬ 
ditional raw data to learn features in an unsuper¬ 
vised fashion. Generally, optimal results are ob¬ 
tained by performing both. Table [5] shows the cur¬ 
rent Benchmarks in this task for the English PTB. 
Accuracies on the test set is reported on column 
“acc”. Columns “feat” and “data” define whether 


















hand-crafted features are used and whether addi¬ 
tional data was used. We can see that even without 
feature engineering or unsupervised pretraining, 
our C2W model (row “C2W”) is on par with the 
current state-of-the-art system (row “structReg”). 
However, if we add hand-crafted features, we can 
obtain further improvements on this dataset (row 
“C2W + features”). 

However, there are many words that do not con¬ 
tain morphological cues to their part-of-speech. 
For instance, the word snake does not contain any 
morphological cues that determine its tag. In these 
cases, if they are not found labelled in the training 
data, the model would be dependent on context to 
determine their tags, which could lead to errors in 
ambiguous contexts. Unsupervised training meth¬ 


ods such as the Skip-n-gram model (Mikolov et 


al., 20131 can be used to pretrain the word rep¬ 


resentations on unannotated corpora. If such pre¬ 
training places cat, dog and snake near each other 
in vector space, and the supervised POS data con¬ 
tains evidence that cat and dog are nouns, our 
model will be likely to label snake with the same 
tag. 

We train embeddings using English wikipedia 
with the dataset used in (Ling et al., 2015] ), and 
the Structured Skip-n-gram model. Results using 
pre-trained word lookup tables and the C2W with 
the pre-trained word lookup tables as additional 
parameters are shown in rows “word(sskip)” and 
“C2W + word(sskip)”. We can observe that both 
systems can obtain improvements over their ran¬ 
dom initializations (rows “word” and (C2W)). 

Finally, we also found that when using the C2W 
model in conjunction pre-trained word embed¬ 
dings, that adding a non-linearity to the repre¬ 
sentations extracted from the C2W model c ( u , im¬ 
proves the results over using a simple linear trans¬ 
formation (row “C2W(tanh)+word (sskip)”). This 
setup, obtains 0.28 points over the current state-of- 
the-art system(row “SCCN”). 

A similar model that a convolutional model to 


learn additional representations for words (San¬ 


tos and Zadrozny, 2014) (row “CNN (Santos and 


Zadrozny, 2014)”). However, results are not di¬ 


rectly comparable as a different set of embeddings 
is used to initialize the word lookup table. 


5.3 Discussion 

It is important to refer here that these results do 
not imply that our model always outperforms ex- 



feat 

data 

acc 

word 



no 

no 

96.70 

C2W 



no 

no 

97.36 

word+features 

yes 

no 

97.34 

C2W+features 

yes 

no 

97.57 

Stanford 2.0 (Manning, 2011 

yes 

no 

97.32 

structReg 

Sun, 2014 

yes 

no 

97.36 

word (sskip) 

no 

yes 

97.42 

C2W+word (sskip) 

no 

yes 

97.54 

C2W(tanh)+word (sskip) 

no 

yes 

97.78 

CNN (. 

Santos and Zadrozny, 2014' 

no 

yes 

97.32 

Morce 

Spoustova et al., 2009, 

yes 

yes 

97.44 

SCCN 

Spgaard, 2011, 

yes 

yes 

97.50 


Table 5: POS accuracy result comparison with 
state-of-the-art systems for the English PTB. 


isting benchmarks, in fact in most experiments, 
results are typically fairly similar to existing sys¬ 
tems. Even in Turkish, using morphological anal¬ 
ysers in order to extract additional features could 
also accomplish similar results. The goal of our 
work is not to overcome existing benchmarks, but 
show that much of the feature engineering done in 
the benchmarks can be learnt automatically from 
the task specific data. More importantly, we wish 
to show large dimensionality word look tables can 
be compacted into a lookup table using characters 
and a compositional model allowing the model 
scale better with the size of the training data. This 
is a desirable property of the model as data be¬ 
comes more abundant in many NLP tasks. 

6 Related Work 

Our work, which learns representations without 
relying on word lookup tables has not been ex¬ 
plored to our knowledge. In essence, our model 
attempts to learn lexical features automatically 
while compacting the model by reducing the re¬ 
dundancy found in word lookup tables. Individ¬ 
ually, these problems have been the focus of re¬ 
search in many areas. 

Lexical information has been used to augment 
word lookup tables. Word representation learn¬ 
ing can be thought of as a process that takes a 
string as input representing a word and outputs 
a set of values that represent a word in vector 
space. Using word lookup tables is one possi¬ 
ble approach to accomplish this. Many meth¬ 
ods have been used to augment this model to 
learn lexical features with an additional model 
that is jointly maximized with the word lookup 
table. This is generally accomplished by either 










































performing a component-wise addition of the em¬ 


beddings produced by word lookup tables (Chen 


et al., 2015), and that generated by the additional 
lexical model, or simply concatenating both rep¬ 
resentations ( [Santos and Zadrozny, 2014 1 . Many 
models have been proposed, the work in ( |Col- 
lobert et al., 2 011) refers that additional features 
sets Fi can be added to the one-hot representa¬ 
tion and multiple lookup tables 1^ can be learnt 
to project each of the feature sets to the same 


low-dimensional vector e 


w 


For instance, the 


work in (Botha and Blunsom, 2014) shows that us¬ 
ing morphological analyzers to generate morpho¬ 
logical features, such as stems, prefixes and suf¬ 
fixes can be used to learn better representations 
for words. A problem with this approach is the 
fact that the model can only learn from what has 
been defined as feature sets. The models proposed 


in (Santos and Zadrozny, 2014: Chen et al., 2015[ ) 

allow the model to arbitrary extract meaningful 
lexical features from words by defining composi¬ 
tional models over characters. The work in (Chen 


et al., 2015 l defines a simple compositional model 
by summing over all characters in a given word, 
while the work in (Santos and Zadrozny, 2014 1 
defines a convolutional network, which combines 
windows of characters and a max-pooling layer to 
find important morphological features. The main 
drawback of these methods is that character or¬ 
der is often neglected, that is, when summing over 
all character embeddings, words such as dog and 
god would have the same representation accord¬ 
ing to the lexical model. Convolutional model are 
less susceptible to these problems as they com¬ 
bine windows of characters at each convolution, 
where the order within the window is preserved. 
However, the order between extracted windows is 
not, so the problem still persists for longer words, 
such as those found in agglutinative languages. 
Yet, these approaches work in conjunction with a 
word lookup table, as they compensate for this in¬ 
ability. Aside from neural approaches, character- 
based models have been applied to address mul¬ 
tiple lexically oriented tasks, such as translitera¬ 


tion (Kang and Choi, 2000) and twitter normaliza 


tion (Xu et al., 2013; Ling et al., 20131. 


modeling, it is frequent to prune higher order n- 
grams that do not encode any additional infor¬ 
mation (Seymore and Rosenfeld, 1996; Stolcke, 
1998j Moore and Quirk, 2009| . The same be ap¬ 


plied in machine translation (Ling et al., 2012 


Zens et al., 2012) by removing longer translation 
pairs that can be replicated using smaller ones. In 
essence our model learns regularities at the sub¬ 
word level that can be leveraged for building more 
compact word representations. 

Finally, our work has been applied to depen¬ 
dency parsing and found similar improvements 
over word models in morphologically rich lan¬ 
guages (Ballesteros et al., 2015|). 


7 Conclusion 


We propose a C2W model that builds word em¬ 
beddings for words without an explicit word 
lookup table. Thus, it benefits from being sen¬ 
sitive to lexical aspects within words, as it takes 
characters as atomic units to derive the embed¬ 
dings for the word. On POS tagging, our mod¬ 
els using characters alone can still achieve com¬ 
parable or better results than state-of-the-art sys¬ 
tems, without the need to manually engineer such 
lexical features. Although both language model¬ 
ing and POS tagging both benefit strongly from 
morphological cues, the success of our models in 
languages with impoverished morphological cues 
shows that it is able to learn non-compositional as¬ 
pects of how letters fit together. 

The code for the C2W model and our language 
model and POS tagger implementations is avail¬ 
able from https://github.com/wlinl2/ 
JNN. 
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Compacting models has been a focus of re¬ 
search in tasks, such as language modeling and 
machine translation, as extremely large models 
can be built with the large amounts of training 
data that are available in these tasks. In language 
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